In  Proceedings  of  the  International  Conference  on  Acoustics,  Speech,  and  Signal 
Processing,  Las  Vegas,  Nevada,  March  2008. 

A  PARAMETERIZED  DESIGN  FRAMEWORK  FOR  HARDWARE 
IMPLEMENTATION  OF  PARTICLE  FILTERS 

19  1 

Sankalita  Saha  ,  Neal  K.  Bambha  and  Shuvra  S.  Bhattacharyya 

department  of  Electrical  and  Computer  Engineering,  and  Institute  for  Advanced  Computer  Studies, 
University  of  Maryland,  College  Park,  Maryland,  20742. 

2U.S.  Army  Research  Laboratory, Adelphi,  Maryland,  20783. 

Email:  {ssaha,  ssb}@umd.edu,  nbambha@arl.army.mil 


ABSTRACT 

Particle  filtering  methods  provide  powerful  techniques  for  solving 
non-linear  state-estimation  problems,  and  are  applied  to  a  variety 
of  application  areas  in  signal  processing.  Because  of  their  vast 
computational  complexity,  real-time  hardware  implementation  of 
particle-filter-based  systems  is  a  challenging  task.  However,  many 
particle  filter  applications  share  common  characteristics,  and  the 
same  system  design  can  be  reused  with  appropriate  streamlining. 
To  achieve  this,  a  parameterized  design  framework  for  particle  fil¬ 
ters  is  proposed  in  this  paper.  In  this  framework,  parameterization 
of  system  features  that  vary  over  specific  implementations  enables 
reuse  of  a  generic  design  for  a  wide  range  of  applications  with 
minimal  re-design  effort.  Using  this  framework,  we  explore  differ¬ 
ent  design  options  for  implementing  two  different  particle  filtering 
applications  on  field-programmable  gate  arrays  (FPGAs),  and  we 
present  associated  results  on  trade-offs  between  area  (FPGA 
resource  requirements)  and  execution  speed. 

Index  Terms  —  Field  programmable  gate  arrays,  Parallel  architec¬ 
tures,  Recursive  estimation. 

1.  INTRODUCTION 

Particle  filtering  is  a  powerful,  emerging  methodology  with  a  wide 
range  of  applications  in  science  and  engineering.  Researchers  from 
a  variety  of  fields  ranging  from  signal  processing  to  statistics  and 
econometrics  use  particle  filters  because  of  its  potential  for  coping 
with  difficult  nonlinear  and/or  non-Gaussian  noise  problems.  They 
are  based  on  the  idea  of  approximating  the  probability  density 
functions  (PDFs)  of  the  state  of  a  dynamic  model  by  random  sam¬ 
ples  (particles)  with  associated  weights  and  propagating  them 
across  iterations  based  on  a  probabilistic  model  of  the  state  update 
and  the  measurements.  However,  use  of  particle  filters  in  real-time 
systems  has  been  limited  due  to  their  computational  complexity.  A 
particle  filter  typically  involves  several  complex  mathematical 
operations  that  are  invoked  at  every  iteration  of  the  filter,  as  well  as 
a  large  a  number  of  particles,  which  in  turn  results  in  high  memory 
requirements.  A  possible  solution  for  real-time  implementation  of 
such  systems  is  parallelization  with  the  use  of  multiprocessor  sys¬ 
tems  [2];  but  this  is  also  restricted  because  of  the  presence  of  an 
unavoidable  computing  step  —  resampling  —  which  is  serial  in 
nature,  making  it  difficult  to  fully  parallelize  an  implementation. 
Though  various  efforts  have  been  made  to  derive  distributed  resa¬ 
mpling  algorithms  ([3]),  complete  parallelization  has  not  been  pos¬ 
sible.  This  suggests  the  exploration  customized  solutions. 


Implementation  of  particle  filter  applications  on  hardware  or  hard¬ 
ware/software  codesign  platforms  is  further  challenging  due  to  the 
resource  constraints  on  such  platforms.  Architectural  design  and 
memory  management  for  hardware  implementations  explored 
before  ([1])  mainly  focus  on  one  particle-filter-based  application 
and  optimize  the  various  resulting  subsystems.  Design  and  imple¬ 
mentation  of  a  generic  yet  highly  optimized  architecture  for  all 
particle  filter  based  systems  is  not  possible  because  of  the  wide 
range  of  applications  to  which  particle  filtering  techniques  are 
applied  currently  and  may  be  applied  in  the  future.  But,  there  are 
many  applications  that  share  similarities  as  far  as  particle  filtering 
is  concerned.  A  generic  implementation  framework  that  can  be 
suitably  and  easily  reconfigured  for  different  applications  would 
be  of  significant  utility.  A  limited  form  of  such  a  framework  was 
explored  by  Hong  et  al.,  where  two  particle  filtering  algorithms  are 
implemented  on  the  same  platform,  and  the  system  can  be  config¬ 
ured  to  use  any  one  of  them  through  switching  mechanisms  [5]. 
However,  our  methodology  uses  a  much  higher-level  and  hence 
more  general  approach.  In  particular,  our  methodology  can  be 
applied  to  any  arbitrary  set  of  particle  filtering  algorithms,  rather 
than  being  constrained  to  a  fixed,  pre-defined  set  of  algorithms. 

In  this  paper  a  novel  parameterized  design  framework  to  imple¬ 
ment  particle  filter  based  applications  on  field  programmable  gate 
arrays  (FPGAs)  is  proposed  to  enable  comprehensive  design  space 
exploration  of  the  whole  system  with  attention  to  the  interaction 
between  the  various  subsystems  and  the  different  particle  filtering 
parameter  configurations  that  may  be  used  across  different  appli¬ 
cations.  The  class  of  applications  towards  which  this  work  is 
geared  is  particle  filters  with  underlying  one-dimensional  models. 
Extending  such  an  architecture  to  multi-dimensional  problems  is 
also  of  great  importance  and  is  a  useful  direction  for  future  work. 
Design  and  implementation  details  for  experiments  with  two  appli¬ 
cations  using  the  above  framework  are  provided  to  demonstrate  the 
application  of  the  framework  and  illustrate  its  advantages. 

2.  SYSTEM  DESIGN  FRAMEWORK  FOR  PARTICLE 
FILTER  IMPLEMENTATION 

Particle  filters  provide  a  method  for  recursively  estimating  the 
unknown  state  X,  from  a  collection  of  noisy  observations.  The 
state  parameters  to  be  estimated  depend  on  the  exact  problem 
being  considered.  The  state  transition  and  observation  models  are 

State  Transition  Model:  Xf  =  F(Xf  1?  Wf) ,  and  (1) 
Observation  Model:  Yf  =  G(Xf,  Vt ) ,  (2) 
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where  Wt  is  the  system  noise  and  Vt  is  the  observation  noise.  Xf 
represents  the  dynamically  evolving  state  of  the  system,  and  Yt  is 
the  observation  vector  of  the  system,  which  is  corrupted  by  the 
measurement  noise  Vt  at  instant  t .  The  particle  filter  estimates  the 
state  of  the  system  Xt  and  updates  it  based  on  the  received,  cor¬ 
rupted  observations. 

A  particle  filter  based  system  essentially  consists  of  the  following 
three  computational  steps:  (1)  Sampling :  In  this  step,  samples  (par¬ 
ticles)  of  the  unknown  state  are  generated  based  on  the  given  sam¬ 
pling  function  which  provide  an  estimate  of  the  current  state  of  the 
system  and  also  propagate  the  particles  from  the  previous  time  step 
to  the  current  time;  (2)  Weight  Calculation :  Based  on  the  observa¬ 
tions,  an  importance  weight  is  assigned  to  each  particle  and  (3) 
Resampling :  This  step  involves  redrawing  particles  from  the  same 
probability  density  based  on  some  function  of  the  particle  weights 
such  that  the  weights  of  the  new  particles  are  approximately  equal. 

2.1  Overview 


in  the  functionality  of  any  one  unit.  These  changes  are  handled  by 
the  interfaces  so  that  the  individual  streamlined  units  need  not  be 
redesigned,  which  would  require  significant  effort.  The  PEcores 
perform  the  sampling  operation,  while  a  separate  weight  calcula¬ 
tion  unit  (WU)  is  used  for  calculating  the  weights.  The  PEcore  as 
well  as  the  WU  interact  with  memory  banks  whose  sizes  are 
dependent  on  system  parameters  and  hence  they  can  be  parameter¬ 
ized  as  well.  The  interfaces  provide  parameterized  interaction 
between  the  various  subsystems  such  as  memory  banks  and  the 
resampling  interface  and  perform  synchronization  operations. 

2.2  Design  Framework 

Figure  2  shows  the  overall  design  framework.  We  use  Xilinx’s 
System  Generator  for  design  and  functional  verification  and  the 
Xilinx  ISE  tool-set  for  synthesis.  Xilinx  System  Generator  pro¬ 
vides  a  hardware  library  that  consists  of  various  architectural  units, 
such  as  RAMs  and  adders,  for  modular  design.  It  also  allows  the 
use  of  custom  Verilog  or  VHDL  modules  for  system  design. 


The  architecture  proposed  in  this  paper  for  particle  filters  is  based 
on  the  computational  framework  described  above.  The  wide  range 
of  applications  to  which  particle  filtering  techniques  are  applied 
prohibits  the  focus  on  a  generic  system  architecture  suitable  for  all 
applications.  However,  since  there  exist  wide  ranges  of  applica¬ 
tions  that  use  the  same  particle  filtering  algorithm  with  different 
state  models,  it  is  possible  to  develop  a  generic  architecture  for  a 
subset  of  applications  and  streamline  it  for  specific  applications  in 
this  subset.  The  goal  of  this  framework  is  to  provide  the  user  a  sys¬ 
tematic  approach  for  such  streamlining  —  with  the  ability  to 
explore  the  various  design  tradeoffs  between  area  and  execution 
speed  —  and  provide  the  capability  to  implement  a  wide  range  of 
applications  with  significantly  reduced  re-design  effort. 

To  achieve  this,  a  parameterized  design  framework  is  proposed, 
where  the  overall  system  is  composed  of  small  parameterized  sub¬ 
systems.  Each  such  subsystem  can  be  modified  to  the  needs  of  a 
wide  range  of  applications,  as  well  as  to  final  target  constraints  by 
setting  appropriate  parameters,  such  as  the  memory  size,  and  the 
number  of  particles.  An  overview  of  a  two-processing-element 
configuration  of  our  architecture  is  given  in  figure  1.  The  frame¬ 
work  essentially  consists  of  an  array  of  processing  elements  (PEs), 
and  a  resampling  unit,  along  with  a  set  of  parameterized  interfaces. 
A  PE  consists  of  three  units,  a  PEcore,  a  weight  calculation  unit 
(WU),  and  a  noise  generator.  Each  of  these  units  can  operate  inde¬ 
pendent  of  changes  in  functionality  of  the  other  units.  However, 
the  interaction  between  various  units  can  change  with  the  variation 


As  mentioned  in  section  2.1,  multiple  processing  elements  (PEs) 
for  the  sampling  and  weight  calculation  step  are  used.  Within  a 
given  PE,  further  pipelining  can  generally  be  used,  but  the  degree 
to  which  pipelining  can  be  employed  is  strongly  dependent  on  the 
characteristics  of  the  targeted  application.  The  sampling  and 
weight  calculation  operations  involve  complex  mathematical  oper¬ 
ations,  and  thus  impose  restrictions  on  the  number  of  PEs  that  can 
be  implemented.  The  number  of  particles  handled  by  each  PE  is 

P  =  \P/N  1,  (3) 

where  fjcl  denotes  the  smallest  integer  that  is  greater  than  or 
equal  to  the  real  number  x  ;  P  is  the  number  of  particles;  and  N  is 
the  number  of  PEs.  A  straightforward  memory  management 
scheme  for  particle  storage  and  updating  is  used.  Three  memory 
banks  or  buffers  are  used  for  each  PE  for  storing  (1)  sampled  parti¬ 
cles,  (2)  particle  weights,  and  (3)  resampled  particles.  Since  the 
number  of  memory  banks  that  are  available  on  a  given  platform  is 
limited,  we  have 


N  =  M/3  ,  (4) 

where  M  is  the  number  of  memory  banks  available  on  the  targeted 
FPGA  board.  The  area  consumed  by  the  associated  memory  banks 
directly  depends  on  P  and  N .  The  observation  data  is  stored  in  a 
shared  memory  between  clusters  of  PEs.  The  memory  interface  for 
this  buffer  handles  the  read  requests  from  the  PEs.  The  reading 
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Fig.  1.  Distributed  particle  filter  architecture. 


Fig.  2.  Parameterized  design  framework. 


from  this  memory  for  the  i  th  operation  can  be  overlapped  with 
either  the  resampling  step  of  the  (z  -  1)  th  operation,  the  sampling 
step  of  the  i  th  operation,  or  both.  However,  if  the  system  through¬ 
put  is  greater  than  or  equal  to  the  observation  input  rate,  this  inter¬ 
face  becomes  trivial  as  only  a  single  buffer  is  required. 

There  are  seven  main  interfaces  corresponding  to  the  operations  of 
(1)  observation  data  reading,  (2)  sampled  particle  memory  inter¬ 
facing,  (3)  resampled  particle  memory  interfacing,  (4)  particle 
weight  memory  interfacing  and  (5)  resampling  unit  interfacing. 
Among  these,  the  reading  of  observation  data  is  not  dependent  on 
iV  or  P,  while  the  rest  are  dependent  on  N  and  P .  The  resam¬ 
pling  unit  varies  based  on  the  resampling  scheme  being  used  and  is 
functionally  independent  from  the  rest  of  the  units.  It  is  triggered 
when  all  the  P  particles  have  been  processed  for  a  given  iteration. 
The  resampling  interface  consists  of  a  global  address  generator  and 
a  local  address  generator.  The  global  address  generator  generates 
addresses  for  P  particles  and  depends  on  P .  These  addresses  are 
routed  to  individual  PEs  by  the  local  address  generator,  which, 
thus,  depends  on  both  P  and  N .  In  this  framework,  systematic 
resampling  has  been  used.  However,  this  can  be  easily  replaced 
with  other  sequential  resampling  mechanisms.  Systematic  resam¬ 
pling  is  often  a  preferred  method  due  to  its  computational  simplic¬ 
ity  and  good  empirical  performance.  A  library  of  these 
parameterized  interfaces  and  resampling  schemes  are  created  using 
a  combination  of  Xilinx  System  Generator  hardware  components 
and  custom  HDL  modules. 

The  execution  time  for  resampling  directly  depends  on  P  and  is 
constant  over  all  iterations.  Thus,  the  total  execution  time  (in  terms 
of  clock  cycles)  for  one  iteration  is, 

T  T PECore  ^ resampling  ^  WU ' 

where  Lresampling  is  the  latency  due  to  the  resampling  unit,  Lwu  is 
the  latency  induced  by  the  WU  unit,  and  TPEcore  is  the  execution 
time  of  PEcore,  which  for  a  fully  pipelined  PEcore  is  given  as 

T PEcore  =  L  PEcore  +  •  (6) 

For  systematic  resampling,  the  latency  is  given  by  [1]: 


attained  in  most  cases.  By  varying  the  parameters  appropriately, 
different  implementations  were  obtained  and  various  design 
options  were  explored. 

3.1  Uni-variate  Non-stationary  Growth  Model 

The  first  application  explored  is  an  example  of  a  one-dimensional 
non-linear  system  (typically  studied  in  the  context  of  stochastic 
systems)  [3].  The  model  is  as  follows 

25  xl  . 

Xt  =  0.5  xl;_j  + - 1 -  +  8xcos(1.2x(t-l))+fF  (8) 

1+AVi 

Yt  =  x; V20+V,,  (9) 

where  Wt  and  Vt ,  are  zero-mean  Gaussian  white  noise  with  vari¬ 
ances  10  and  1,  respectively.  The  execution  of  the  PEs  and  the 
resampling  units  is  fully  pipelined.  The  above  equations  were 
mapped  to  appropriate  Xilinx  System  Generator  computation 
blocks  to  build  the  PEcore  and  the  WU.  The  noise  generation  was 
performed  using  Xilinx’s  Gaussian  white  noise  generator.  This 
noise  generator  needs  only  periodic  resetting  to  provide  continuous 
output,  thus  the  PE  interface  did  not  have  to  send  requests  for  data. 
However,  the  initial  latency  of  the  generator  is  10  cycles,  which  is 
present  only  for  the  first  iteration.  Additionally,  Xilinx’s  lookup- 
table-based  cosine  generators  were  used.  These  are  both  fast  and 
inexpensive  (area-efficient)  compared  to  standard  CORDIC  [8] 
cosine  generators.  We  employed  fully-pipelined  multipliers  and 
dividers.  In  the  design,  the  WU  uses  an  exponential  calculation 
unit  that  uses  a  combination  of  a  look-up  table  and  a  polynomial 
approximation  method.  Uniform  random  number  generation  for 
resampling  is  done  using  multiple-bit,  leap-forward  linear  feed¬ 
back  shift  registers  (LFSRs)  [4].  Parameterized  interfaces  were 
used  to  build  the  interconnections  between  the  various  subsystems. 

3.2  Uni-dimensional  Failure  Prognosis  Model 

This  practical  particle  filtering  application  is  adapted  from  [6], 
where  particle  filtering  is  used  to  track  crack  faults  in  the  blades  of 
a  turbine  engine.  The  fault  growth  model  is  given  by 


L resampling  =  2xP-l  .  (7) 

This  signifies  that  the  latency  of  the  resampling  unit  increases 
directly  with  an  increase  in  number  of  particles,  and  thus  the 
latency  will  generally  become  a  bottleneck  for  applications  requir¬ 
ing  very  high  P .  For  the  first  processing  iteration,  any  initial 
latency  that  exists  should  be  added  to  the  latency  model  of  (equa¬ 
tion  6).  Such  initial  latency  may  exist,  for  example,  because  of  ini¬ 
tial  latency  of  the  noise  generator.  For  further  details  the  reader  is 
referred  to  [7]. 

3.  EXPERIMENTS  AND  RESULTS 

In  this  section,  implementations  for  two  different  particle  filter 
problems  using  the  proposed  framework  are  demonstrated  along 
with  corresponding  experimental  results.  The  two  systems  were 
designed  and  synthesized  using  Xilinx  System  Generator  9.1  and 
Xilinx  ISE  9.1.  The  target  device  family  was  the  Xilinx  Virtex  4 
FPGA.  Although  the  FPGA  board  used  in  the  experiments  could 
support  a  clock  frequency  of  500  MHz,  this  frequency  could  not  be 


X 


1  (A-i+A-i+A-i+^-i  +  i) 

Yt  =  Xt+Vt, 


(10) 

(ii) 


Fig.  3.  Percentage  decrease  in  execution  time  (1  iteration)  for 
uni-variate  non-stationary  growth  model  implementation. 


where  Wt  and  Vt  are  zero-mean,  Gaussian  white  noise  with  vari¬ 
ances  1 0  and  1 ,  respectively.  The  PEcore  comprised  of  fully  pipe¬ 
lined  multipliers  and  a  divider,  all  of  which  are  fully  pipelined 
cores  from  Xilinx.  The  Gaussian  white  noise  generator,  exponen¬ 
tial  calculation  unit,  and  uniform  random  number  generator  used  in 
section  3.1  are  reused  again.  The  resampling  unit  and  interfaces 
were  selected  from  the  library  and  design  space  exploration  is  done 
by  varying  P  and  N . 

3.3  Results 

The  percentage  decreases  in  execution  times  compared  to  serial 
execution  are  shown  in  figures  3  and  4  for  the  various  design  cases 
while  the  tracking  results  are  shown  in  figure  5.  The  results  shown 
are  for  one  iteration  at  steady  state  —  i.e.,  not  the  first  iteration, 
where  there  is  additional  latency  due  to  the  Gaussian  white  noise 
generator.  The  resource  utilizations  of  the  various  implementations 
(200  particles)  are  shown  in  table  1  (the  DSP48  utilization  is  same 
for  both  the  applications  due  to  their  similar  nature).  The  block 
RAM  (BRAM)  memory  banks  available  for  the  Virtex  4  device 
family  are  each  of  size  18Kb,  which  is  much  higher  than  what  is 
required  for  any  of  the  implementations.  Increasing  P  affects  only 
the  required  memory  bank  sizes,  thus  the  resource  utilization 
remains  the  same  for  different  numbers  of  particles.  However,  for 
applications  with  larger  memory  requirements,  this  would  not  be 
the  case.  Note  that  the  execution  times  for  both  of  the  applications 
are  similar  because  the  latencies  of  the  PEs  are  relatively  small 
compared  to  the  latency  induced  by  P  . 

4.  CONCLUSIONS 

In  this  paper,  a  new  methodology  for  design,  modeling  and  explo¬ 
ration  for  particle  filters  on  reconfigurable  system-on-chips  (SoCs) 
has  been  presented.  Our  methodology  uses  the  notion  of  parame¬ 
terization  to  provide  a  useful  tool  for  evaluating  multiple  design 
alternatives,  and  exploring  the  associated  trade-offs  in  an  efficient 
and  intuitive  manner.  It  also  provides  scope  for  implementing  a 
wide  range  of  applications  with  minimal  redesign  effort  between 
different  applications.  For  both  of  the  applications  that  we  exam¬ 
ined,  the  execution  speed  was  determined  mainly  by  the  number  of 
particles,  and  thus,  the  latency  of  the  resampling  unit  played  a  sig¬ 
nificant  role  in  determining  the  overall  execution  time.  This 
stresses  the  need  to  look  further  into  methods  for  optimizing  this 
unit.  Although  multiple  expensive  (area-consuming)  computa¬ 
tional  units  were  used,  the  area  constraint  imposed  by  the  target 
platform  was  met.  The  large  underutilization  of  the  block  RAMs 


Fig.  4.  Percentage  decrease  in  execution  time  (1  iteration)  for 
uni-dimensional  failure  prognosis  model  implementation. 


Table  1.  FPGA  Resource  utilization  for  uni-variate  non-sta- 
tionary  growth  model  (Applcn  1)  and  unidmensional  failure 
prognosis  model  (Applcn  2)  implementation. 


No.  of 
PEs 

Slices 

Slice  Flip-Flops 

4  input  LUTs 

DSP48S 

BRAMs 

Applcnl 

Applcn2 

Applcnl 

Applcn2 

Applcnl 

Applcn2 

2 

29.17% 

19.81% 

8.54% 

6.13% 

19.91% 

14.27% 

43.23% 

15.63% 

3 

42.25% 

28.39% 

12.56% 

8.94% 

28.6% 

20.18% 

60.93% 

23.44% 

5 

68.97% 

45.57% 

20.61% 

14.57% 

45.73% 

31.98% 

96.35% 

39.06% 

indicates  potential  for  extending  this  architecture  to  systems  with 
higher  memory  requirements,  and  this  is  an  interesting  direction 
for  future  work. 
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